Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slowdown after generating backslash escape sequences #91

Closed
wants to merge 2 commits into from

Conversation

AriX
Copy link
Contributor

@AriX AriX commented Apr 19, 2024

Addresses an issue I've seen which I believe is the same as what's reported in #90: an edge case where JSONSchemaParser's shortcut_key implementation did not consider the correct parser while finishing parsing an escape sequence. When the UnionParser that StringParsingState pops onto the stack after getting a BACKSLASH is finished but not yet removed from the stack, shortcut_key fails to return json_freetext.

@noamgat
Copy link
Owner

noamgat commented Apr 19, 2024

OK, now I think I understand the problem you reported and the solution you proposed. However I think your specific solution may create illegal outputs, as if the topmost parser still has potential characters that can be inserted, we are not necessarily returning the correct results.
I think the safe thing to do, which will also fix the performance problem, is to check (iteratively) if the topmost parser is can_end=True and get_allowed_characters() is empty - because that means that it is effectively done and will be ignored.

Better yet, this check should happen not inside the shortcut key, but during the jsonschemaparser itself - if a member in the stack is in can_end=True and get_allowed_characters()=empty, it shouldn't be there anymore because we know what will happen next time. I think this is the correct solution for this problem.
If you want, you can implement it that way. If not, I will try to get to it.

@AriX
Copy link
Contributor Author

AriX commented Apr 19, 2024

I think the safe thing to do, which will also fix the performance problem, is to check (iteratively) if the topmost parser is can_end=True and get_allowed_characters() is empty - because that means that it is effectively done and will be ignored.

Good point! I just implemented this.

Better yet, this check should happen not inside the shortcut key, but during the jsonschemaparser itself - if a member in the stack is in can_end=True and get_allowed_characters()=empty, it shouldn't be there anymore because we know what will happen next time. I think this is the correct solution for this problem.

That sounds very reasonable but I'm not sure exactly where in the lifecycle of JSONSchemaParser you had mind for this check to happen.

For now, I implemented the former change. If you want to give more specific guidance I'm happy to try the latter.

@AriX
Copy link
Contributor Author

AriX commented Apr 19, 2024

However I think your specific solution may create illegal outputs, as if the topmost parser still has potential characters that can be inserted, we are not necessarily returning the correct results.

On 2nd thought, I'm not totally sure this is true. If the topmost parser's "can end" is true, then the JSONSchemaParser's get_allowed_characters() already includes the allowed characters of the next parser, without regard for whether or not the topmost parser has allowed characters left. It seems correct for the logic in shortcut_key() should match the logic in get_allowed_characters().

Let me know if I'm missing something.

@noamgat
Copy link
Owner

noamgat commented Apr 19, 2024

get_allowed_characters() not of the JSONSchemaParser, but of a specific object in its stack.
If the specific parser is "finished but not removed", it will have can_end=True and get_allowed_character=''

I've implemented it in this branch:
https://github.com/noamgat/lm-format-enforcer/tree/feature/escaping_performance

can you check if it solves the performance issue you are encountering?

@AriX
Copy link
Contributor Author

AriX commented Apr 19, 2024

get_allowed_characters() not of the JSONSchemaParser, but of a specific object in its stack. If the specific parser is "finished but not removed", it will have can_end=True and get_allowed_character=''

Yeah, I got that. I guess I understand you original point now, in that if the topmost parser has some characters that are allowed but wouldn't be allowed by json_freetext, we might prevent those from being generated.

can you check if it solves the performance issue you are encountering?

Yes, it will take me a few minutes as I don't use lm-format-enforcer directly but maintain a port of it on the side

@noamgat
Copy link
Owner

noamgat commented Apr 19, 2024

You can also see the diff here: d5dcdc5 and apply it locally

@AriX
Copy link
Contributor Author

AriX commented Apr 19, 2024

You can also see the diff here: d5dcdc5 and apply it locally

My port is in Swift so applying the patches takes some work :)
May open-source it some day.

You can also see the diff here: d5dcdc5 and apply it locally

This change doesn't work for me; in my tests it strips StringParserState off of the stack too early, preventing its parsedString from being read.

I can't reproduce this in the Python implementation. I suspect the reason is because my version has a stricter whitespace policy, which means after reading a string my string parser has 0 allowed characters where this implementation has whitespace in its allowed characters.

So, for my use case I think I need to go with the version that alters the shortcut_key() implementation to check for can_end() and empty get_allowed_characters().

But, of course I defer to you if you'd prefer to use an implementation like what you have in feature/escaping_performance here.

@noamgat
Copy link
Owner

noamgat commented Apr 20, 2024

You're right, added handling of the StringParser in this commit: 34032f6
Unit tests now pass

@noamgat
Copy link
Owner

noamgat commented Apr 20, 2024

I merged the holistic solution together with the string parser fix to master. Thank you so much for spotting this issue, the unit test suite now runs twice as fast in the CI! This is a great find. I will be closing this PR soon.

@AriX
Copy link
Contributor Author

AriX commented Apr 20, 2024

Thank you so much for spotting this issue, the unit test suite now runs twice as fast in the CI! This is a great find.

That's great - glad I could help!

@AriX AriX closed this Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants